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failure is also referred to as an "event" . Survival analysis tries to model time-to-event 
data, which is usually subject to censoring due to the termination of study. The 
main goal is to study the dependence of the survival time T on covariate variables 
X = (Xi , X2, • • • , X p ) T , where p denotes the dimensionality of the covariate space. 
One common way of achieving this goal is hazard regression, which studies how the 
conditional hazard function of T depends on the covariate X = x, which is defined 
as 

h(t\x) = lim —P\t <T <t + At\T > t, X = x}. 
Atj.0 At 

According to the definition, the conditional hazard function is nothing but the 
instantaneous rate of failure at time t given a particular value x for the covariate 
X. The proportional hazards model is very popular, partially due to its simplicity 
and its convenience in dealing with censoring. The model assumes that 

h(t\x) = fto(t)*(x), 

in which ho(t) is the baseline hazard function and ^(x) is the covariate effect. Note 
that this model is not uniquely determined in that cho(t) and ^(x)/c give the same 
model for any c > 0. Thus one identifiability condition needs to be specified. When 
the identifiability condition \&(0) = 1 is enforced, the function ho(t), the conditional 
hazard function of T given X = 0, is called the baseline hazard function. 

By taking the reparametrization ^(x) = e^W, Cox (1972, 1975) introduced the 
proportional hazards model 

h(t\x) = h (t)e^ x \ 

Sec Klein and Moeschbcrgcr (2005) and references therein for more detailed litera- 
ture on Cox's proportional hazards model. 

Here the baseline hazard function ho(t) is typically completely unspecified and 
needs to be estimated nonparametrically. A linear model assumption, ?/»(x) = x T /3, 
may be made, as is done in this paper. Here (3 = (fa, fa, ■ ■ ■ , f3 p ) T is the regression 
parameter vector. While conducting survival analysis, we not only need to estimate 
(3 but also have to estimate the baseline hazard function ho(t) nonparametrically. 
Interested readers may consult Klein and Moeschberger (2005) for more details. 

Recent technological advances have made it possible to collect a huge amount 
of covariate information such as microarray, proteomic and SNP data via bioimag- 
ing technology while observing survival information on patients in clinical studies. 
However it is quite likely that not all available covariates are associated with the 
clinical outcome such as the survival time. In fact, typically a small fraction of 
covaraites are associated with the clinical time. This is the notion of sparsity and 
consequently calls for the identification of important risk factors and at the same 
time quantifying their risk contributions when we analyze time-to-event data with 
many predictors. Mathematically, it means that we need to identify which fas are 
nonzero and also estimate these nonzero /3jS. 

Most classical model selection techniques have been extended from linear regres- 
sion to survival analysis. They include the best-subset selection, stepwise selection, 
bootstrap procedures (Sauerbrei and Schumacher, 1992), Bayesian variable selec- 
tion (Faraggi and Simon, 1998; Ibrahim et al., 1999). Please see references therein. 
Similarly, other more modern penalization approaches have been extended as well. 
Tibshirani (1997) applied the LASSO penalty to survival analysis. Fan and Li 
(2002) considered survival analysis with the SCAD and other folded concave penal- 
ties. Zhang and Lu (2007) proposed the adaptive LASSO penalty while studying 
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time-to-event data. Among many other considerations is Li and Dicker (2009). 
Available theory and empirical results show that these penalization approaches 
work well with a moderate number of covariates. 

Recently we have seen a surge of interest in variable selection with an ultra-high 
dimensionality. By ultra-high, Fan and Lv (2008) meant that the dimensionality 
grows exponentially in the sample size, i.e., log(p) = 0{n a ) for some a £ (0, 1/2). 
For ultra-high dimensional linear regression, Fan and Lv (2008) proposed sure in- 
dependence screening (SIS) based on marginal correlation ranking. Asymptotic the- 
ory is proved to show that, with high probability, SIS keeps all important predictor 
variables with vanishing false selection rate. An important extension, iterative SIS 
(ISIS), was also proposed to handle difficult cases such as when some important 
predictors are marginally uncorrelated with the response. In order to deal with 
more complex real data, Fan, Samworth and Wu (2009) extended SIS and ISIS to 
more general loss based models such as generalized linear models, robust regres- 
sion, and classification and improved some important steps of the original ISIS. In 
particular, they proposed the concept of conditional marginal regression and a new 
variant of the method based on splitting samples. A non-asymptotic theoretical 
result shows that the splitting based new variant can reduce false discovery rate. 
Although the extension of Fan, Samworth and Wu (2009) covers a wide range of 
statistical models, it has not been explored whether the iterative sure independence 
screening method can be extended to hazard regression with censoring event time. 
In this work, we will focus on Cox's proportional hazards model and extend SIS 
and ISIS accordingly. Other extensions of SIS include Fan and Song (2010) and 
Fan, Feng and Song (2010) to generalized linear models and nonparametric addi- 
tive models, in which new insights are provided via elegant mathematical results 
and carefully designed simulation studies. 

The rest of the article is organized as follows. Section 2 details the Cox's pro- 
portional hazards model. An overview of variable selection via penalized approach 
is given in Section 3 for Cox's proportional hazards model. We extend the SIS and 
ISIS procedures to Cox's model in Section 4. Simulation results in Section 5 and 
real data analysis in Section 6 demonstrate the effectiveness of the proposed SIS 
and ISIS methods. 

2. Cox's proportional hazards models 

Let T, C, and X denote the survival time, the censoring time, and their associated 
covariates, respectively. Correspondingly, denote by Y = min{T, C} the observed 
time and 5 — I(T < C) the censoring indicator. For simplicity we assume that T and 
C are conditionally independent given X and that the censoring mechanism is non- 
informative. Our observed data set {(xi,yi,Si) : x; £ R p ,yi G M + ,5i 6 {0,1}, i = 
1, 2, • • • , n} is an independently and identically distributed random sample from a 
certain population (X, Y, S). Define C = {i : Si = 0} and U = {i : Si = 1} to be the 
censored and uncensorcd index sets, respectively. Then the complete likelihood of 
the observed data set is given by 

n 

ieu tec ieu i=i 

where f(t\x), F(t\x) = J t °° /(s|x)ds, and h(t\x) = f(t\x)/F(t\x.) arc the conditional 
density function, the conditional survival function, and the conditional hazard func- 
tion of T given X = x, respectively. 
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Let t\ < < ■ ■ ■ < t 1 ^ be the ordered distinct observed failure times. Let (j) 
index its associate covariates X(j\ and lZ(t) be the risk set right before the time t: 
lZ(t) = {i : yi > t}. Consider the proportional hazards model, 

(2.1) /i(i|x) = Vt)cxp(x T /3), 

where ho(t) is the baseline hazard function. In this model, both ho(t) and (3 arc 
unknown and have to be estimated. Under model (2.1), the likelihood becomes 

JV n 

L = JJ/i (y( J ))exp(x^. ) /3) JJexp{-F (t/i)exp(xf/3)}, 

j=l i=l 

where Ho(t) = J* ho(s)ds is the corresponding cumulative baseline hazard function. 

Following Breslow's idea, consider the "least informative" nonparametric mod- 
cling for Hq(-), in which Ho(t) has a possible jump hj at the observed failure time 
t°j, namely, H (t) = J^f=i < Then 

(2.2) H {y i ) = Y,h j I(ieK(t j )). 
Consequently the log-likelihood becomes 

JV n JV 

^{log(^) + x^/3} - J2iJ2 h ^ e n ^ ex p( x ^)}- 

j=l i=l j=l 

Maximizcr hj is given by 

(2.3) h j {/3) = { J2 cxp(xf/3)}- 1 . 

ien(t°) 

Putting this maximizer back to the log-likclihood, we get 

JV 

^[x^/3-log{ ]T cxp(xf/3)}], 

which is equivalent to 

n n 

(2.4) ^(/3)=^xf/3-^log{ £ exp(xJ/3)}. 

i=l i=l j£%) 

by using the censoring indicator 5,-. This is the partial likelihood due to Cox (1975). 

Maximizing £(/3) in (2.4) with respect to (3, we can get an estimate f3 of the 
regression parameter. Once (3 is available, we may plug it into (2.3) to get hj(j3). 
These newly obtained hj((3)s can be plugged into (2.2) to obtain our nonparametric 
estimate of the baseline cumulative hazard function. 

3. Variable selection for Cox's proportional hazards model via 
penalization 

In the estimation scheme presented in the previous section, none of the estimated 
regression coefficients is exactly zero, leaving all covariates in the final model. Con- 
sequently it is incapable of selecting important variables and handling the case 
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with p > n. To achieve variable selection, classical techniques such as the best- 
subset selection, stepwise selection, and bootstrap procedures have been extended 
accordingly to handle Cox's proportional hazards model. 

In this section, we will focus on some more advanced techniques for variable 
selection via penalization. Variable selection via penalization has received lots of 
attention recently. Basically it uses some variable selection-capable penalty function 
to regularize the objective function while performing optimization. Many variable 
selection-capable penalty functions have been proposed. A well known example is 
the L\ penalty A which is also known as the LASSO penalty (Tibshirani, 

1996). Among many others are the SCAD penalty (Fan and Li, 2001), elastic- net 
penalty (Zou and Hastie, 2005), adaptive L\ (Zhang and Lu, 2007; Zou, 2006), and 
minimax concave penalty (Zhang, 2009). 

Denote a general penalty function by p\(-), where A > is a regularization 
parameter. From derivations in the last section, penalized likelihood is equivalent 
to penalized partial likelihood: While maximizing £(f3) in (2.4), one may regularize 
it using ^2^ =1 P\(Pj)- Equivalcntly we solve 

p 
i=i 

by including a negative sign in front of £((3). In the literature, Tibshirani (1997), 
Fan and Li (2002), and Zhang and Lu (2007) considered the L\, SCAD, and adap- 
tive L\ penalties while studying time-to-event data, respectively, among many oth- 
ers. 

In this paper, we will use the SCAD penalty for our extensions of SIS and ISIS 
whenever necessary. The SCAD function is a quadratic spline and symmetric around 
the origin. It can be defined in terms of its first order derivative 

v'M) = \{i m <» + (a ( A a :f)^ i>*}}' 

for some a > 2 and j3 ^ 0. Here a is a parameter and Fan and Li (2001) recommend 
to use a = 3.7 based on a Bayesian argument. The SCAD penalty is plotted in 
Figure 1 for a = 3.7 and A = 2. The SCAD penalty is non-convex, leading to non- 
convex optimization. For the non-convex SCAD penalized optimization, Fan and Li 
(2001) proposed the local quadratic approximation; Zou and Li (2008) proposed 
the local linear approximation; Wu and Liu (2009) presented the difference convex 
algorithm. In this work, whenever necessary we use the local linear approximation 
algorithm to solve the SCAD penalized optimization. 

4. SIS and ISIS for Cox's proportional hazard model 

The penalty based variable selection techniques work great with a moderate num- 
ber of covariates. However their usefulness is limited while dealing with an ultra- 
high dimensionality as shown in Fan and Lv (2008). In the linear regression case, 
Fan and Lv (2008) proposed to rank covariates according to the absolute value of 
their marginal correlation with the response variable and select the top ranked co- 
variates. They provided theoretical result to guarantee that this simple correlation 
ranking retains all important covariates with high probability. Thus they named 
their method sure independence screening (SIS). In order to handle difficult prob- 
lems such as the one with some important covariates being marginally uncorrelated 
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Fig 1. PZoi o/ t/ie SCAD penalty with a = 3.7 and A = 2 



with the response, they proposed iterative SIS (ISIS). ISIS begins with SIS, then it 
regresses the response on covariates selected by SIS and uses the regression residual 
as a "working" response to recruit more covariates with SIS. This process can be 
repeated until some convergence criterion has been met. Empirical improvement 
over SIS has been observed for ISIS. In order to increase the power of the sure in- 
dependence screening technique, Fan, Samworth and Wu (2009) has extended SIS 
and ISIS to more general models such as generalized linear models, robust regres- 
sion, and classification and made several important improvements. We now extend 
the key idea of SIS and ISIS to handle Cox's proportional hazards model. 

Let A4* be the index set of the true underlying sparse model, namely, M* = 
{j : (i* 7^ and 1 < j < p}, where (3*s arc the true regression coefficients in the 
Cox's proportional hazards model (2.1). 

4-1- Ranking by marginal utility 

First let us review the definition of sure screening property. 

Definition 1 (Sure Screening Property). We say a model selection procedure sat- 
isfies sure screening property if the selected model M. with model size o p (n) includes 
the true model Ai* with probability tending to one. 

For each covariatc X m (1 < m < p), define its marginal utility as the maximum 
of the partial likelihood of the single covariate: 

(n n \ 

^ $i%imPm ~ ^ 8j log{ ^2 ex P( x jrnPm)} J ■ 

Here Xi m is the m-th element of Xj, namely x, = (xn,Xi2, ■ ■ ■ ,Xi P ) T . Intuitively 
speaking, the larger the marginal utility is, the more information the corresponding 
covariate contains the information about the survival outcome. Once we have ob- 
tained all marginal utilities u m for m = 1, 2, • • • ,p, we rank all covariates according 
to their corresponding marginal utilities from the largest to the smallest and select 
the d top ranked covariates. Denote by X the index set of these d covariates that 
have been selected. 
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The index set X is expected to cover the true index set M* with a high probabil- 
ity, especially when we use a relative large d. This is formally shown by Fan and Lv 
(2008) for the linear model with Gaussian noise and Gaussian covariates and sig- 
nificantly expanded by Fan and Song (2010) to generalized linear models with non- 
Gaussian covariates. The parameter d is usually chosen large enough to ensure 
the sure screening property. However the estimated index set X may also include 
a lot of unimportant covariates. To improve performance, the penalization based 
variable selection approach can be applied to the selected subset of the variables 
{Xj , j G T) to further delete unimportant variables. Mathematically, we then solve 
the following penalized partial likelihood problem: 

(n n 
- E + Y 5i l0g { E eX P( X X,i^x)} + E PxWm) 

i=l i=l j£R(i/i) rael 

where xj.j denotes a sub- vector of Xj with indices in I and similarly for (3 X - It will 
lead to sparse regression parameter estimate j3 x - Denote the index set of nonzero 
components of f3 x by A4, which will serve as our final estimate of A4*. 



4-2. Conditional feature ranking and iterative feature selection 

Fan and Lv (2008) pointed out that SIS can fail badly for some challenging scenarios 
such as the case that there exist jointly related but marginally unrelated covari- 
ates or jointly uncorrelatcd covariates having higher marginal correlation with the 
response than some important predictors. To deal with such difficult scenarios, it- 
erative SIS (ISIS) has been proposed. Comparing to SIS which is based on marginal 
information only, ISIS tries to make more use of joint covariates' information. 

The iterative SIS begins with using SIS to select an index set X±, upon which 
a penalization based variable selection step is applied to get regression parameter 
estimate Xi . A refined estimate of the true index set is obtained and denoted by 
M. i , the index set corresponding to nonzero elements of 0% . 

As in Fan, Samworth and Wu (2009), we next define the conditional utility of 
each covariate m that is not in A^i as follows: 

u m \M 1 = m ax ^^(lX + x^/^) 

Pm.,P Ml \ i = l 

\ 

»=i jen( yi ) J 

This conditional utility measures the additional contribution of the mth covariate 
given that all covariates with indices in Ai\ have been included in the model. 

Once the conditional utilities have been defined for each covariate that is not 
in M-x, we rank them from the largest to the smallest and select these covariates 
with top rankings. Denote the index set of these selected covariates by T-i- With I2 
having been identified, we minimize 

n n 

(4.1) -E^AW.AiUzJ + E^ 10 ^ E ex P( X Al 1 ui 2 ,/Al 1 ui 2 )} 
t=l i=l j£Tl(.yi) 

+ E 

m£M 1 UX 2 
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with respect to 0j^ lU ± 2 to get sparse estimate 0jQi lU ± 2 - Denote the index set cor- 
responding to nonzero components of 0^ u ± ^° ^ e -M-2, which is our updated 
estimate of the true index set Ai* . Note that this step can delete some variables 
{Xj 6 Aii} that were previously selected. This idea was proposed in Fan et al. 
(2009) and is an improvement of the idea in Fan and Lv (2008). 

The above iteration can be repeated until some convergence criterion is reached. 
We adopt the criterion of either having identified d covariates or Aij = Aij-\ for 
some j. 

4-3. New variants of SIS and ISIS for reducing FSR 

Fan, Samworth and Wu (2009) noted that the idea of sample spliting can also be 
used to reduce the false selection rate. Without loss of generality, we assume that 
the sample size n is even. We randomly split the sample into two halves. Then 
apply SIS or ISIS separately to the data in each partition to obtain two estimates 
IS 1 ' and XS 1 ^ 1 of the true index set Ai* . Both these two estimates could have high 
FSRs because they are based on a simple and crude screening method. Yet each 
of them should include all important covariates with high probabilities. Namely, 
important covariates should appear in both sets with probability tending to one 
asymptotically. Define a new estimate by intersection I = XS^ fll' 2 '. The new 
estimate X should include all important covaraites with high probability as well due 
to properties of each individual estimate. However by construction, the number of 
unimportant covariates in the new estimate X is much smaller. The reason is that, 
in order for an unimportant covariate to appear in I, it has to be included in both 
IS X > and iS 1 ^ randomly. 

For the new variant method based on random splitting, Fan, Samworth and Wu 
(2009) obtained some non-asymptotic probability bound for the event that r unim- 
portant covaraites are included in the intersection X for any natural number r un- 
der some exchangeability condition on all unimportant covariates. The probability 
bound is decreasing in the dimensionality, showing a "blessing of dimensionality" . 
Please consult Fan, Samworth and Wu (2009) for more details. We want to remark 
that their theoretical bound is applicable to our setting as well while studying time- 
to-event data because theoretical bound is based on splitting the sample into two 
halves and only requires the independence between these two halves. 

While defining new variants, we may use the same d as used in the original 
SIS and ISIS. However it will lead to a very aggressive screening. We call the 
corresponding variant the first variant of (I)SIS. Alternatively, in each step we 
may choose larger X^ and XS 1 ^ to ensure that their intersection iS* 1 ' n X^ 2 ' has 
d covariates, which is called the second variant. The second variant ensures that 
there are at least d covariates included before applying penalization in each step 
and is thus much less aggressive. Numerical examples will be used to explore their 
performance and prefer to the first variant. 

5. Simulation 

5.1. Design of simulations 

In this section, we conduct simulation studies to show the power of the (I) SIS and 
its variants by comparing them with LASSO (Tibshirani, 1997) in the Cox's pro- 
portional hazards model. Here the regularization parameter for LASSO is tuned 
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via five fold cross validation. Most of the settings are adapted from Fan and Lv 
(2008) and Fan, Samworth and Wu (2009). Four different configurations are con- 
sidered with n = 300 and p = 400. And two of them are revisited with a different 
pair of sample size n = 400 and dimensionality p = 1000. Covariates in different 
settings are generated as follows. 

Case 1: X\,--- ,X p are independent and identically distributed ./V(0, 1) random 
variables. 

Case 2: X\, ■ • • , X p are multivariate Gaussian, marginally iV(0, 1), and with serial 
correlation corr(Xj, X,) = p if i ^ j. Here we take p = 0.5. 

Case 3: X\, • • • , X p are multivariate Gaussian, marginally N(0, 1), and with corre- 
lation structure corr(X;,^4) = l/\/2 for all i ^ 4 and corr(X,,X,) = 1/2 if 
i and j are distinct elements of {1, • • • ,p}\{4}. 

Case 4: X\, ■ ■ ■ ,X P are multivariate Gaussian, marginally iV(0, 1), and with cor- 
relation structure corr(Xi, X$) = for all i =^ 5, coTi(Xi, X4) = 1/V2 for 
all i ^ {4,5}, and corr(Ai,X,) = 1/2 if i and j are distinct elements of 
{l,...,p}\{4,5}. 

Case 5: Same as Case 2 except n = 400 and p = 1000. 

Case 6: Same as Case 4 except n — 400 and p = 1000. 

Here, Case 1 with independent predictors is the most straightforward for variable 
selection. In Cases 2-6, however, we have serial correlation such that corr(X;, Xj) 
does not decay as \i — j\ increases. We will see later that for Cases 3, 4 and 6, the 
true coefficients are specially chosen such that the response is marginally indepen- 
dent but jointly dependent of X4. We therefore expect variable selection in these 
situations to be much more challenging, especially for the non-iterated versions of 
SIS. Notice that in the asymptotic theory of SIS in Fan and Lv (2008), this type of 
dependence is ruled out by their Condition (4). 

In our implementation, we choose d = [ 4 1 - j for both the vanilla version of 
SIS (Van-SIS) and the second variant (Var2-SIS). For the first variant (Varl-SIS), 
however, we use d = Lk^J (note that since the selected variables for the first 
variant are in the intersection of two sets of size d, we typically end up with far 
fewer than d variables selected by this method). For any type of SIS or ISIS, we 
apply SCAD with these selected predictors to get a final estimate of the regression 
coefficients at the end of the screening step. Whenever necessary, the BIC is used 
to select the best tuning parameter in the rcgularization framework. 

In all setting, the censoring time is generated from exponential distribution with 
mean 10. This corresponds to choosing the baseline hazard function ho(t) = 0.1 for 
t > 0. The true regression coefficients and censoring rate in each of the six cases 
are as follows: 

Case 1: p 1 = -1.6328, fa = 1.3988, fa = -1.6497, fa = 1.6353, fa = -1.4209, fa = 
1.7022, and fa = for j > 6. The corresponding censoring rate is 33%. 

Case 2: The coefficients are the same as Case 1. The corresponding censoring rate 
is 27%. 

Case 3: fa =4,fa = 4, fa = 4, fa = -6\/2, and fa = for j > 4. The corresponding 

censoring rate is 30%. 
Case 4: fa = A, fa = A, fa = A, fa = -&V2, fa = 4/3 and fa = for j > 5. The 

corresponding censoring rate is 31%. 
Case 5: fa = -1.5140, fa = 1.2799, fa = -1.5307, fa = 1.5164, fa = -1.3020, As = 

1.5833, and fa = for j > 6. The corresponding censoring rate is 23%. 
Case 6: The coefficients are the same as Case 4. The corresponding censoring rate 
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is 36%. 

In Cases 1, 2 and 5 the coefficients were chosen randomly, and were generated as 
{A\ogn/«Jn+ \Z\/A)U with Z ~ N(0, 1) and U = 1 with probability 0.5 and -1 with 
probability 0.5, independent of Z. For Cases 3, 4, and 6, the choices ensure that 
even though (3^ ^ 0, we have that X4 and Y are marginally independent. The fact 
that Xi is marginally independent of the response is designed to make it difficult 
for the common independent learning to select this variable. In Cases 4 and 6, we 
add another important variable X§ with a small coefficient to make it even more 
difficult. 

5.2. Results of simulations 

We report our simulation results based on 100 Monte Carlo repetitions for each 
setting in Tables 1-7. To present our simulation results, we use several different 
performance measures. In the rows labeled ||/3 — (3\\i and \\(3 — j3\\\ : we report the 
median L\ and squared L2 estimation errors ||/3 — (3\\i = Y^=i \Pj ~ Pj\ an d ||/3 — 
= \Pj ~ fij\ 2 i respectively, where the median is over the 100 repetitions. 

In the row with label P%, we report the proportion of the 100 repetitions that the 
(I) SIS procedure under consideration includes all of the important variables in the 
model, while the row with label Pi reports the corresponding proportion of times 
that the final variables selected, after further application of the SCAD penalty, 
include all of the important ones. We also report the median model size (MMS) of 
the final model among 100 repetitions in the row labeled MMS. 

Table 1 

Results for Cases 1 and 2. Here P\ stands for the probability that (I)SIS includes the true 
model, i.e., has the sure screening property. P2 stands for the probability that the final model 
has the sure screening property. MMS stands for Median Model Size among 100 repetitions. The 
sample size n = 300 and the number of covariates is p = 400. 



Case 1: independent covariates 





Van-SIS 


Van-ISIS 


Varl-SIS 


Varl-ISIS 


Var2-SIS 


Var2-ISIS 


LASSO 




0.79 


0.57 


0.73 


0.61 


0.76 


0.62 


4.23 


11/3 


0.13 


0.09 


0.15 


0.1 


0.15 


0.1 


0.98 


Pi 


1 


1 


0.99 


1 


0.99 


1 




P2 


1 


1 


0.99 


1 


0.99 


1 


1 


MMS 


7 


6 


6 


6 


6 


6 


68.5 




Case 2: Equi-correlated covariates 


with p = 0.5 






Van-SIS 


Van-ISIS 


Varl-SIS 


Varl-ISIS 


Var2-SIS 


Var2-ISIS 


LASSO 


WP-Mt 


2.2 


0.64 


4.22 


0.8 


3.95 


0.78 


4.38 


WP-Ml 


1.74 


0.11 


4.71 


0.29 


4.07 


0.28 


0.98 


Pi 


0.71 


1 


0.42 


0.99 


0.46 


0.99 




P2 


0.71 


1 


0.42 


0.99 


0.46 


0.99 


1 


MMS 


7 


6 


6 


6 


7 


6 


57 



We report results of Cases 1 and 2 in Table 1. Recall that the covariates in 
Case 1 are all independent. In this case, the Van-SIS performs reasonably well. Yet, 
it does not perform well for the dependent case, Case 2. Note the only difference 
between Case 1 and Case 2 is the covariance structure of the covariates. For both 
cases, vanilla-ISIS and its second variant perform very well. It is worth noticing 
that the ISIS improves significantly over SIS, when covariates are dependent, in 
terms of both the probability of including all the true variables and in reducing the 
estimation 
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error. This comparison indicates that the ISIS performs much better when there 
is serious correlation among covariates. 

While implementing the LASSO penalized Cox's proportional hazards model, 
we adapted the Fortran source code in the R package "glmpath." Recall that the 
objective function in the LASSO penalized Cox's proportional hazards model is 
convex and nonlinear. What the Fortran code does is to call a MINOS subroutine to 
solve the corresponding nonlinear convex optimization problem. Here MINOS is an 
optimization software developed by Systems Optimization Laboratory at Stanford 
University. This nonlinear convex optimization problem is much more complicated 
than a general quadratic programming problem. Thus generally it takes much longer 
time to solve, especially so when the dimensionality is high as confirmed by Table 
3. However the algorithm we used does converge as the objective function is strictly 
convex. 

Table 1 shows that LASSO has the sure screening property as the ISIS, however, 
the median model size is ten times as large as that of ISIS. As a consequence, it 
also has larger estimation errors in terms of \\(3 — (3\\i and ||/3 — /3|||. The fact 
that the median absolute deviation error is much larger than the median square 
error indicates that the LASSO selects many small nonzero coefficients for those 
unimportant variables. This is also verified by the fact that LASSO has a very large 
median model size. The explanation is the bias issue noted by Fan and Li (2001). 
In order for LASSO to have a small bias for nonzero coefficients, a smaller A should 
be chosen. Yet, a small A recruits many small coefficients for unimportant variables. 
For Case 2, the LASSO has a similar performance as in Case 1 in that it includes 
all the important variables but has a much larger model size. 

Table 2 

Results for Cases 3 and 4- The same caption as Table 1 is used. 



Case 3: An important predictor that is independent of survival time 





Van-SIS 


Van-ISIS 


Varl-SIS 


Varl-ISIS 


Var2-SIS 


Var2-ISIS 


LASSO 


11/3 


20.1 


1.03 


20.01 


0.99 


20.09 


1.08 


20.53 


11/3 


94.72 


0.49 


100.42 


0.47 


94.77 


0.55 


76.31 


Pi 





1 





1 





1 




Pi 





1 





1 





1 


0.06 


MMS 


13 


4 


8 


4 


13 


4 


118.5 




Case 4: Two 


very hard variables to be selected. 






Van-SIS 


Van-ISIS 


Varl-SIS 


Varl-ISIS 


Var2-SIS 


Var2-ISIS 


LASSO 


11/3 -/3||i 


20.87 


1.15 


20.95 


1.4 


20.96 


1.41 


21.04 


11/3 -/3||! 


96.46 


0.51 


102.14 


1.77 


97.15 


1.78 


77.03 


Pi 





1 





0.99 





0.99 




Pi 





1 





0.99 





0.99 


0.02 


MMS 


13 


5 


9 


5 


13 


5 


118 



Results of Cases 3 and 4 are reported in Table 2. Note that, in both cases, the 
design ensures that X4 is marginally independent of but jointly dependent on Y . 
This special design disables the SIS to include X4 in the corresponding identified 
model as confirmed by our numerical results. However, by using ISIS, we are able 
to select X4 for each repetition. Surprisingly, LASSO rarely includes A4 even if 
it is not a marginal screening based method. Case 5 is even more challenging. In 
addition to the same challenge as case 4, the coefficient (3$ is 3 times smaller than the 
first four variables. Through the correlation with the first 4 variables, unimportant 
variables {Xj, j > 6} have a larger marginal utility with Y than A5. Nevertheless, 
the ISIS works very well and demonstrates once more that it uses adequately the 
joint covariatc information. 
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Table 3 

The average running time (in seconds) comparison for Van- ISIS and LASSO. 





Case 1 


Case 2 


Case 3 


Case 4 


Van-ISIS 


379.29 


213.44 


402.94 


231.68 


LASSO 


37730.82 


26348.12 


46847 


28157.71 



We also compare the computational cost of van-ISIS and LASSO in Table 3 
for Cases 1-4. Table 3 shows that it takes LASSO several hours for each repetition, 
while van-ISIS can finish it in just several minutes. This is a huge improvement. For 
this reason, for Cases 5 and 6 where p = 1000, we only report the results for ISIS 
since it takes LASSO over several days to complete a single repetition. Results of 
Cases 5 and 6 are reported in Table 4. The table demonstrates similar performance 
as Cases 2 and 4 even with more covariates. 

Table 4 

Results for Cases 5 and 6. The same caption as that of Table 1 is used. 



Case 5: The same as case 2 with p = 1000 and n = 400 





Van-SIS 


Van-ISIS 


Varl-SIS 


Varl-ISIS 


Var2-SIS 


Var2-ISIS 


110-/% 


1.53 


0.52 


3.55 


0.55 


2.95 


0.51 


110-0111 


0.9 


0.07 


3.48 


0.08 


2.5 


0.07 


Pi 


0.82 


1 


0.39 


1 


0.5 


1 


Pi 


0.82 


1 


0.39 


1 


0.5 


1 


MMS 


8 


6 


6 


6 


7 


6 


Case 6: The same as 


case 4 with p = 1000 and n = 400. 




Van-SIS 


Van-ISIS 


Varl-SIS 


Varl-ISIS 


Var2-SIS 


Var2-ISIS 


110-/% 


20.88 


0.99 


20.94 


1.1 


20.94 


1.29 


11/3 -0111 


93.53 


0.39 


104.76 


0.44 


94.02 


1.35 


Pi 





1 





1 





0.99 


P 2 





1 





1 





0.99 


MMS 


16 


5 


8 


5 


16 


5 



To conclude the simulation section, we demonstrate the difficulty of our simu- 
lated models by showing the distribution, among 100 simulations, of the minimum 
|i|-statistic for the estimates of the true nonzero regression coefficients in the oracle 
model with only true important predictors included. More explicitly, during each 
repetition of each simulation setting, we pretend to know the index set M* the 
true underlying sparse model, fit the Cox's proportional hazards model using only 
predictors with indices in M* by calling function "coxph" of R package "survival", 
and report the smallest absolute value of the i-statistic for the regression estimates. 
For example, for case 1, the model size is only 6 and the minimum |i|-statistic is 
computed based on these 6 estimates for each simulation. This shows the difficulty 
to recover all significant variables even in the oracle model with the minimum model 
size. The corresponding boxplot for each case is shown in Figure 2. To demonstrate 
the effect of including unimportant variables, the minimum |i|-statistic for the es- 
timates of the true nonzero regression coefficients is recalculated and shown by 
the boxplots in Figure 3 for the model with the true important variables and 20 
unimportant variables. 

As expected, cases 1 and 2 are relatively easy cases, whereas cases 3 and 4 are 
relatively harder in the oracle model. When we are not in the oracle model with 20 
noisy variables are added, the difficulty increases as shown in Figure 3. It has more 
impact on cases 3, 4 and 6. 
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Case 1 Case 2 Case 3 Case 4 Case 5 



Case ( 



Fig 2. The boxplots of the minimum \t\-statistic in the oracle models among 100 simulations. 



6. Real data 



In this section, we use one real data set to demonstrate the power of the proposed 
method. The Neuroblastoma data set is due to Oberthuer et al. (2006). It was used 
in Fan, Samworth and Wu (2009) for classification studies. Neuroblastoma is an 
extracranial solid cancer. It is most common in childhood and even in infancy. The 
annual number of incidences is about several hundreds in the United States. Neu- 
roblastoma is a malignant pediatric tumor originating from neural crest elements 
of the sympathetic nervous system. 

The study includes 251 patients of the German Neuroblastoma Trials NB90- 
NB2004, who were diagnosed between 1989 and 2004. The patients' ages range 
from to 296 months at diagnosis with a median age of 15 months. Neuroblastoma 
specimens of these 251 patients were analyzed using a customized oligonucleotide 
microarray. The goal is to study the association of gene expression with variable 
clinical information such as survival time and 3-year event free survival, among 
others. 

We obtained the neuroblastoma data from the MicroArray Quality Control 
phase-II (MAQC-II) project conducted by the Food Drug Administration (FDA). 
The complete data set includes gene expression at 10,707 probe sites. It also in- 
cludes the survival information of each patient. In this example, we focus on the 
overall survival. There are five outlier arrays. After removing outlier arrays from our 
consideration, there arc 246 patients. The (overall) survival information is available 
all of these 246 patients. The censoring rate is 205/246, which is very heavy. The 
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Fig 3. The boxplots of the minimum \t\-statistic in the models where 20 noise variables are added 
among 100 simulations. 
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Fig 4. Estimated survivor function for 246 patients. 



survival times of those 246 patients are summarized in Figure 4. 
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Table 5 

Estimated coefficients for Neuroblastoma data 


Probe ID 


estimated coefficient 


standard error 


p- value 


A.23.P31816 


0.864 


0.203 


2.1e-05 


A.23.P31816 


-0.940 


0.314 


2.8e-03 


A.23.P31816 


-0.815 


1.704 


6.3e-01 


A_32_P424973 


-1.957 


0.396 


7.8e-07 


A_32_P159651 


-1.295 


0.185 


2.6e-12 


Hs61272.2 


1.664 


0.249 


2.3e-ll 


Hsl3208.1 


-0.789 


0.149 


l.le-07 


Hsl50167.1 


1.708 


1.687 


3.1e-01 



As real data are always complex, there may be some genes that are marginally 
unimportant but work jointly with other genes. Thus it is more appropriate to 
apply iterative SIS instead of SIS, since the former is more powerful. We stan- 
dardize each predictor to have mean zero and standard deviation 1 and apply 
van-ISIS to the standardized data with d = [n/\og(n)\ = 43. ISIS followed with 
SCAD penalized Cox regression selects 8 genes with probe site names: A_23_P31816, 
A.23.P31816, A.23.P31816, A_32_P424973, A.32_P159651, Hs61272.2, Hsl3208.1, 
and Hsl50167.1. 
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Fig 5. Estimated baseline survivor function 

Now we try to provide some understanding to the significance of these selected 
genes in predicting the survival information in comparison to other genes that are 
not selected. We first fitted the Cox's proportional hazard model with all these 
eight genes. Estimated coefficients are given in Table 5, estimated baseline survival 
function is plotted in Figure 5, and the corresponding log-(partial)likclihood (2.4) is 
-129.3517. The log-likelihood corresponding to the null model without any predictor 
is -215.4561. A x 2 test shows the obvious significance of the model with the eight 
selected genes. Table 5 shows that there are two estimated coefficients that are 
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statistically insignificant at a — 1%.. 

Next for each one of these eight genes, we remove it, fit Cox's proportional haz- 
ard model with the other seven genes, and get the corresponding log-likelihood. 
The eight log-likelihoods are -137.5785, -135.1846, -129.4621, -142.4066, -156.4644, 
-158.3799, -141.0432, and -129.8390. Their average is -141.2948, a decrease of log- 
likelihood by 11.9431, which is very significant with reduction one gene (the re- 
duction of the degree of freedom by 1). In comparison to the model with the eight 
selected genes, x 2 tests shows significance for all selected genes except A_23_P31816 
and Hsl50167.1. This matches the p-values reported in Table 5. 

Finally we randomly select 2 genes out of the genes that are not selected, fit 
the Cox's proportional hazard model with the above eight genes plus these two 
randomly selected genes, and record the corresponding log-likelihood. We repeat 
this process 20 times. We find that the average of these 20 new log-likelihoods is - 
128.3933, an increase of the log-likelihood merely by 0.9584 with two extra variables 
included. Comparing to the model with the eight selected genes, % 2 test shows no 
significance for the model corresponding to any of the 20 repetitions. 

The above experiments show that the selected 8 genes are very important. Delet- 
ing one reduces a lot of log-likelihood, while adding two random genes do not in- 
crease very much the log-likelihood. 

7. Conclusion 

We have developed a variable selection technique for the survival analysis with the 
dimensionality that can be much larger than sample size. The focus is on the iter- 
ative sure independence screening, which iterativcly applies a large-scale screening 
that filters unimportant variables by using the conditional marginal utility, and a 
moderate-scale selection by using penalized partial likelihood method, which se- 
lects further the unfiltered variables. The methodological power of the vanilla ISIS 
has been demonstrated via carefully designed simulation studies. It has sure inde- 
pendence screening with very small false selection. Comparing with the version of 
LASSO we used, it is much more computationally efficient and far more specific in 
selection important variables. As a result, it has much smaller absolute deviation 
error and mean square error. 

References 

Cox, D. R. (1972). Regression models and life-tables (with discussion). Journal 

of the Royal Statistical Society Series B, 34 187-220. 
Cox, D. R. (1975). Partial likelihood. Biometrika, 62 269-76. 
Fan, J., Feng, Y. and Song, R. (2010). Nonparametric independence screening 

in sparse ultra-high dimensional additive models. Submitted. 
Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood 

and its oracle properties. Journal of the American Statistical Association, 96 

1348-1360. 

Fan, J. and Li, R. (2002). Variable selection for cox's proportional hazards model 

and frailty model. Annals of Statistics, 30 74-99. 
Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional 

feature space (with discussion). Journal of the Royal Statistical Society Series B, 

70 849-911. 

imsart-coll ver. 2009/08/13 file: sis-survival-revlsed.tex date: May 20, 2010 



High- dimensional variable selection 



17 



Fan, J., Samworth, R. and Wu, Y. (2009). Ultrahigh dimensional variable 
selection: beyond the lienar model. Journal of Machine Learning Research. To 
appear. 

Fan, J. and Song, R. (2010). Sure independence screening in generalized linear 
models with np-dimcnsionality. The Annals of Statistics, to appear. 

Faraggi, D. and Simon, R. (1998). Bayesian variable selection method for cen- 
sored survival data. Biometrics, 54 1475-5. 

Ibrahim, J. G., Chen, M.-H. and Maceachern, S. N. (1999). Bayesian variable 
selection for proportional hazards models. Canandian Journal of Statistics, 27 
70117. 

Klein, J. P. and Moeschberger, M. L. (2005). Survival analysis. 2nd ed. 
Springer. 

Li, Y. and Dicker, L. (2009). Dantzig selector for censored linear regression. 
Technical report, Harvard University Biostatistics. 

Oberthuer, A., Berthold, F., Warnat, P., Hero, B., Kahlert, Y., Spitz, 
R., Ernestus, K., Konig, R., Haas, S., Eils, R., Schwab, M., Brors, 
B., Westermann, F. and Fischer, M. (2006). Customized oligonucleotide 
microarray gene expressionbased classification of neuroblastoma patients out- 
performs current clinical risk stratification. Journal of Clinical Oncology, 24 
5070-5078. 

Sauerbrei, W. and Schumacher, M. (1992). A bootstrap resampling proce- 
dure for model building: Application to the cox regression model. Statistics in 
Medicine, 11 2093-2109. 

Tibshirani, R. (1997). The lasso method for variable selection in the cox model. 
Statistics in Medicine, 16 385-95. 

Tibshirani, R. J. (1996). Regression shrinkage and selection via the lasso. Journal 
of the Royal Statistical Society, Series B, 58 267-288. 

Wu, Y. and Liu, Y. (2009). Variable selection in quantile regression. Statistica 
Sinica, 19 801-817. 

Zhang, C.-H. (2009). Penalized linear unbiased selection. Ann. Statist. To appear. 
Zhang, H. H. and Lu, W. (2007). Adaptive lasso for cox's proportional hazards 

model. Biometrika, 94 691-703. 
Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the 

American Statistical Association, 101 1418-1429. 
Zou, H. and Li, R. (2008). One-step sparse estimates in nonconcave penalized 

likelihood models (with discussion). Annals of Statistics, 36 1509-1566. 
Zou, H. and Hastie, T. (2005). Regularization and variable selection via the 

clastic net. Journal of the Royal Statistical Society, Series B, 67 301-320. 



imsart-coll ver. 2009/08/13 file: sis-survival-revised.tex date: May 20, 2010 



